Path-BigBird: An AI-Driven Transformer Approach to Classification of Cancer Pathology Reports

PURPOSE Surgical pathology reports are critical for cancer diagnosis and management. To accurately extract information about tumor characteristics from pathology reports in near real time, we explore the impact of using domain-specific transformer models that understand cancer pathology reports. METHODS We built a pathology transformer model, Path-BigBird, by using 2.7 million pathology reports from six SEER cancer registries. We then compare different variations of Path-BigBird with two less computationally intensive methods: Hierarchical Self-Attention Network (HiSAN) classification model and an off-the-shelf clinical transformer model (Clinical BigBird). We use five pathology information extraction tasks for evaluation: site, subsite, laterality, histology, and behavior. Model performance is evaluated by using macro and micro F1 scores. RESULTS We found that Path-BigBird and Clinical BigBird outperformed the HiSAN in all tasks. Clinical BigBird performed better on the site and laterality tasks. Versions of the Path-BigBird model performed best on the two most difficult tasks: subsite (micro F1 score of 72.53, macro F1 score of 35.76) and histology (micro F1 score of 80.96, macro F1 score of 37.94). The largest performance gains over the HiSAN model were for histology, for which a Path-BigBird model increased the micro F1 score by 1.44 points and the macro F1 score by 3.55 points. Overall, the results suggest that a Path-BigBird model with a vocabulary derived from well-curated and deidentified data is the best-performing model. CONCLUSION The Path-BigBird pathology transformer model improves automated information extraction from pathology reports. Although Path-BigBird outperforms Clinical BigBird and HiSAN, these less computationally expensive models still have utility when resources are constrained.


INTRODUCTION
In 2023 alone, an estimated 1.9 million new cancer cases will be diagnosed in the United States, and an estimated 609,000 cancer-related deaths will occur. 1To gain insights into cancer incidence and survival, the National Cancer Institute's (NCI's) SEER program collects information via 19 population-wide cancer registries as a primary source for unbiased population-level research.Pathology reports are the primary source of information used for phenotyping tumor cases.Traditionally, the cancer registrars manually review pathology reports to extract important phenotypic information from them. 2 However, this manual review has contributed to a significant delay in NCI cancer incidence reporting. 3Recent advancements in deep learning (DL) and natural language processing (NLP) have made the near real time autoextraction of information from pathology reports an achievable goal. 4,5A Hierarchical Self-Attention Network (HiSAN) model is currently used in production across SEER registries to automatically extract information from approximately 25% of all records. 6Over the past 2 years, novel developments in the field have presented new opportunities for increasing the proportion of pathology reports that can be autocoded at high accuracy.
The traditional classification model, which operates within a supervised DL framework, is characterized by a fixed architecture that accommodates only specific trained tasks.This rigid framework poses challenges for adapting the trained model to new prediction tasks.As a result, new models that use the same underlying data must be trained from scratch for each extraction task, thereby increasing the computational costs.Large-scale language models, which typically use the transformer model architecture, present a solution by extracting inherent patterns in text that can be used for additional supervised or unsupervised learning tasks beyond the initially trained task.Unsupervised learning with pretrained weights captures the underlying patterns in data but is na ïve to an outcome.This method contrasts traditional DL supervised models, in which the weights are produced by using outcome-driven pattern recognition. 7,8he unsupervised nature of transformers has created an opportunity for general domain transformer models such as BERT 7 and GPT, 9 which are trained on general text corpora.The general domain transformers have created an accessible and accelerated framework for outcome-driven fine-tuning models for downstream tasks.
In recent years, the development of specialized biomedical and clinical transformers has expanded the applicability of transformer models to the health care domain.1][12][13] By training on health care-specific data sets, clinical transformers can capture domain-specific patterns and terminology, thereby enabling them to perform tasks such as medical diagnosis, EHR analysis, clinical text classification, and entity recognition.Notably, the Clinical BigBird model has achieved state-of-theart performance for longer sequences in a medical text by leveraging training with a sparse attention mechanism. 12 this study, we evaluate the effectiveness of transformer models for information extraction from pathology reports.Because a previous study 14 found that a general domain transformer fine-tuned on pathology reports failed to outperform the benchmark HiSAN model, we revisit the question by developing a domain-specific transformer for pathology reports using the BigBird architecture.To delineate the effects of this domain-specific pathology transformer, Path-BigBird, we also assess the performance of Clinical BigBird, which was pretrained on a more general clinical notes corpus using the same transformer architecture.Accordingly, this study makes three contributions: (1) we introduce an approach for developing the pathology text-specific Path-BigBird model, (2)  we test multiple versions of the Path-BigBird model against Clinical BigBird and the HiSAN baseline on a large pathology report data set, and (3) we identify cases in which Path-BigBird outperforms HiSAN and Clinical BigBird and evaluate the implications for future studies.

Data Description
Our data set is composed of electronic cancer pathology reports collected from six SEER Registries during a period from 2004 to 2021: Louisiana (LA), Kentucky (KY), Utah (UT), New Jersey (NJ), New Mexico (NM), and Seattle (WA).The combined data set has 2,772,103 pathology reports from 981,944 tumor cases and a total of 878,072 unique patients.Each registry collects cancer pathology reports from histology pathology laboratories, and each report describes the tumor and the date the report was generated.The cancer registrars abstract and code information from pathology reports with different disease classifications for oncology (ICD-O). 15These classifications are essential information and aid in cancer incidence tracking at the population level.We consider five classification tasks on the basis of annotated ICD-O classification: site, subsite, histology, laterality, and behavior (Fig 1).The site and subsite are based on 3character and 5-character ICD-O-3 codes, respectively. 16istology, laterality, and behavior tasks are the cancer characteristics. 17This research was approved by the DOE Institutional Review Board and determined to be exempt from informed consent: DOE000152.
We split the annotated pathology data set into three parts for model building: train, validation, and test.To reflect the real-world setting, the test data set was extracted by selecting the most recently diagnosed tumor cases per registry, which resulted the test data set to be 15% of the pathology data set for a total of 394,351 pathology reports from 169,663 tumor cases.The remaining records were randomly split into train (70%) and validation (15%).The distribution of records by training split across each cancer registry is shown in Table 1.

Outcome Labels: Five Tasks
The information extraction task consisted of the following categories: site with 70 classes, subsite with 324 classes, behavior with four classes, histology with 626 classes, and laterality with seven classes (Data Supplement, Table S1 shows number of classes across the training splits).Table 2 shows the top classes for each task and their prevalence across the training splits.The distribution of pathology reports across different classes highlights the imbalance of the data set.The information extraction labels were annotated by registrars for each tumor case.All pathology reports for the given tumor case are assigned the same labels.Data Supplement (Fig S1 ) shows the distribution of sequence length of pathology reports.

Model Development and Evaluation
In this study, we compare our novel pathology transformer (Path-BigBird) with two different model architectures: the current state-of-the-art classification model (HiSAN) 5 and a clinical transformer (Clinical BigBird). 18As shown in Figure 2, the three model architectures have substantial differences.All three models use attention mechanisms. 8iSAN is a multitask classification model, in which the feature extraction layer weights are trained based on the positive outcome on all five information extraction tasks. 5By contrast, the transformer models consist of a two-stage model building approach: an attention-based pretraining stage, followed by a fine-tuning stage with a classification setup.We compare the models in three categories: data tokenization, pretraining, and fine-tuning.

Data Tokenization
HiSAN and transformer models have different approaches to data tokenization.The HiSAN classification model uses a word2vec-based embedding, whereas the transformer model uses a subword tokenization approach.The word2vec approach is based on a fixed vocabulary, and this limits the ability to learn out-of-vocabulary words.Conversely, subword tokenization can handle out-of-vocabulary words by breaking them down into smaller units.We use two types of subword tokenizers for the transformer model: sentence piece (SP) tokenizer 19 and word piece (WP) tokenizer. 20inical BigBird uses an SP tokenizer that has been pretrained on MIMIC-III EHRs with a vocabulary size of 50,358 (Fig 2).For Path-BigBird, we introduce three variations.First, Path-BigBird-WP uses a WP tokenizer trained on pathology reports with a vocabulary size of 32,000.Second, Path-BigBird-SP uses the same SP settings as Clinical BigBird, but the tokenizer is trained on pathology reports instead of MIMIC-III.Third, CV-Path-BigBird-SP uses the SP tokenizer from Clinical BigBird.The CV-Path-BigBird-SP model will help us understand how a vocabulary's source affects the model's performance.

Pretraining
One distinguishing factor of transformers from other traditional classification models is the pretraining stage, in which the learning of the model weights is not dependent on the clinical outcome.Typically, a classification model such as HiSAN does not have this pretraining stage.The transformer model's pretraining stage is based on an unsupervised task called masked language modeling. 7For pretraining, Path-BigBird uses the BigBird transformer model architecture on the basis of a sparse attention mechanism. 18In a standard dense attention mechanism, every token attends to every other token, and this can become computationally expensive for long sequences.Sparse attention reduces the number of tokens that each token attends to, thereby allowing for fast and efficient processing of long sequences and reducing the memory requirements without significantly compromising accuracy.Pretraining the pathology transformer models on SEER reports is intended to help capture and comprehend the unique language patterns within the pathology reports in the model (Fig 2) Clinical BigBird is an off-the-shelf model pretrained on the MIMIC-III EHR data in a previous study. 12The comprehensive MIMIC-III data set comprises various clinical notes, including progress notes, radiology reports, and other relevant medical documents.By training on this diverse range of EHR notes, the clinical transformer learns the inherent patterns and structures within these types of medical records.

Fine-Tuning
Fine-tuning, or downstream training, involves taking a pretrained language model or a classification model and performing supervised learning on one or more classification tasks. 7The pretrained language model has already learned the patterns and relationships between words in a large corpus of text data, and fine-tuning the model on a specific task involves modifying the final layers of the model to predict the specific class labels for that task.
The classification model (HiSAN) and the fine-tuning task for the transformer models are designed as multitask classification models, specifically as five pathology information extraction tasks.Previous studies have demonstrated the advantages of multitask classification for pathology report classification. 5,21aluation Metrics To compare the different models, we use F 1 micro scores and F 1 macro scores as evaluation metrics on a holdout test data set.These metrics have been used in similar studies of pathology report classification. 5,22The F 1 micro and F 1 macro scores are calculated on all five tasks, which are being evaluated independently.The F 1 micro measures the global performance of the model, and the F 1 macro provides equal weight to each class and an overall evaluation of the model's performance across all classes.Confidence intervals were calculated using the normal approximation method, 23 under the assumption that the accuracy of k models estimated with independent random draws from the test holdout data would produce accuracy scores that follow a normal distribution.We also measure the time taken to pretrain the models on the available high-performance computing resources, and this enables us to quantify the model's computational needs.

RESULTS
Next, we present the performance results and a comparative analysis of HiSAN, Clinical BigBird, and Path-BigBird.Table 3 lists the micro and macro F 1 scores, respectively, for HiSAN, Clinical BigBird, and Path-BigBird (see also Fig 2).The evaluation was conducted on the test split of the pathology data set, which consisted of 394,351 pathology reports (refer to Table 1).Notably, the HiSAN model was run to convergence, whereas the transformer fine-tuning tasks were run for a fixed number of epochs, which is potentially short of convergence.
When comparing the transformer models with the HiSAN models, the transformer models outperform HiSAN in both micro and macro F 1 scores across all tasks.The bestperforming model is listed in bold font within the tables.For all five tasks, the best-performing model is a transformer.Notably, when comparing the clinical transformer with the pathology transformers, Clinical BigBird has better micro F 1 scores for the site task (with 93.26%) and the laterality task (with 92.78%).By contrast, the pathology transformer models excel in the subsite (with 72.53%), histology (with 80.69%), and behavior (with 97.92%) tasks.The superior performance of the pathology transformer models in the subsite and histology categories is not surprising because these two tasks involve the most classes and are, therefore, more challenging.Because the pathology transformer models were pretrained and fine-tuned on pathology reports, they can better capture subtle language patterns to differentiate between the various classes.
When comparing the models' macro F 1 scores, the CV-Path-BigBird-SP model and Clinical BigBird are the bestperforming models.The CV-Path-BigBird-SP achieves the highest F 1 macro score on the most difficult tasks: subsite with 35.76% and histology with 37.04%.The macro F 1 performance on histology had the highest jump of 3%-4% compared to the other models.The results indicate that vocabulary from deidentified EHRs benefits pathology reports.EHR-based vocabulary is more compact but diverse during pretraining, which leads to better generalization of the model and improved performance on pathology data extraction tasks (Data Supplement, Table S2 displays micro and macro F 1 scores for all tasks across different sequence length groups, providing insight into generalization across varying sequence lengths).
Table 4 shows the class-wise accuracy of CV-Path-BigBird-SP for the different tasks, highlighting accuracy disparity among classes, especially for the difficult tasks: subsite and histology.In reference to subsite and histology, it has come to our attention that the least prevalent categories exhibited an F 1 score of zero for 12% and 36% of the subsite and histology classes, respectively.This finding is a contributing factor to the previously discussed low macro F 1 score.We observe a general decrease in F 1 score as prevalence decreases for site, laterality, and behavior tasks (detailed class-wise accuracy is reported in the Data Supplement, Tables S3-S7).
A model's building time is an essential metric because transformers are compute-intensive models owing to their complex architecture and the many parameters in the model.Path-BigBird is the most computationally expensive model, whereas Clinical BigBird uses off-the-shelf embeddings, and the HiSAN does not have a pretraining stage.For fine-tuning, Path-BigBird and Clinical BigBird follow identical processes, resulting in similar time requirements (please refer to the Data Supplement, Table S8, which lists the model training times).However, it is worth noting that the fine-tuning process for the transformer models takes longer because of the need to update the weights of both the pretraining and fine-tuning stages.Despite the computational demands, these findings highlight the importance of transformers for capturing complex patterns and relationships, thereby facilitating enhanced performance in pathology information extraction tasks.

DISCUSSION
In this study, we describe the effectiveness of using large language models (LLMs) in the clinical pathology domain.
The study and development of domain-specific transformers has become increasingly important with the widespread availability and ease of use of pretrained general domain language models.These general domain transformers, although powerful in various language tasks, often lack the specialized medical knowledge required for accurate decision making and meaningful insights in clinical data. 24,25e show that transformer models perform better than our previously published HiSAN model on the five pathology information extraction tasks. 5r results suggest that customizing transformer models for specific clinical domains (eg, oncology) is important.Generic off-the-shelf models are generally trained using general clinical data and may fail to capture the nuances and intricacies unique to oncology.Moreover, although using general clinical transformers to extract information from pathology reports may be an effective strategy in a computationally constrained environment, limitations still exist in their ability to capture the subtle language nuances specific to pathology.These subtle nuances play a crucial role in distinguishing classes in more challenging tasks, such as histology and subsite.This highlights the need for domain-specific models to achieve the accuracy required for deployment in health care applications.Thus, although general or clinical offthe-shelf models can be useful, the development of specialized models increases accuracy in health care contexts.
Notably, our study also revealed the importance of the model's tokenizer and vocabulary to the pathology report classification.First, we found that the improved performance of the transformer models can be partially attributed to using subword tokenizers because they can handle out-ofvocabulary words by breaking them down into smaller units.Second, we found that CV-Path-BigBird-SP has superior performance because it uses a cleaner, deidentified vocabulary compared with other pathology transformers.This observation highlights the significant impact of vocabulary definition on the performance of language models.Surprisingly, this EHR-derived vocabulary, although created from a clinical context rather than specifically tailored for pathology, yielded better results (in terms of macro F 1 score) than the pathology-specific vocabulary.This raises an important question about the reliance on LLMs to comprehend and process data.By ensuring standardized and privacypreserving data, we can enhance model performance.This highlights the trade-off between domain specificity and data cleanliness when training language models for sensitive clinical tasks.The success of CV-Path-BigBird underscores the value of thorough data preparation for accurate outcomes in clinical NLP.It also encourages further exploration of data cleaning and deidentification techniques to facilitate data privacy and improved performance in health care applications.Recent advances in LLMs have piqued the interests of researchers and clinicians in the potential benefits of artificial intelligence (AI) for health care.Our previous projects have shown that AI can be used to improve the speed and accuracy of disease reporting at the national scale. 26With this project, we show that the recent advances in LLMs can be adapted to improve the performance of current AI tools.We have demonstrated the power of using a population-wide repository of pathology reports to train a domain-specific transformer.In doing so, we have identified potential opportunities for improving and expanding this research.First, future studies should quantify the data and algorithmic biases that may unequally affect marginalized populations.Although our data are broadly representative of the US population, 27,28 the level of detail and semantics may differ by race, socioeconomic status, and so on.Second, future studies should explore more advanced ways to quantify the uncertainty in each classification. 29The work presented here is a glimpse of the upcoming AI-powered paradigm shift in cancer care.Although we must be cautious, these tools hold the potential to significantly improve patient care.
Transformer models have emerged as powerful tools for information extraction tasks in cancer pathology reports.These models, including clinical transformers, can capture complex patterns and correlations between words in clinical text.The availability of pretrained language models has facilitated the development of domain-specific clinical downstream tasks.However, when pretraining, we must consider the domain gap between generic text and domainspecific text to ensure reliable performance.Using domainspecific text can contribute to the accurate extraction of relevant information from cancer pathology reports, thereby facilitating improved clinical decision making and patient care.For future work, we plan to further test the model's generalizability and reusability by extending the tasks to other clinically useful tasks, such as biomarker extraction and identification of malignant and metastatic disease.

TABLE 1 .
Distribution of Reports, Cases, and Patients in the SEER Cancer Pathology Data Set by Training Split

TABLE 2 .
Class-Wise Distribution of Reports for the Five Information Extraction Tasks Across Training Splits Abbreviations: ML, malignant lymphoma; NOS, not otherwise specified.

TABLE 3 .
Macro and Micro Scores of Different Models With CI

TABLE 4 .
Class Wise Accuracy Across Tasks on Test Split